PRINCIPAL MANIFOLDS AND NONLINEAR DIMENSION 
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Abstract. Nonlinear manifold learning from unorganized data points is a very challenging 
unsupervised learning and data visualization problem with a great variety of applications. In this 
paper we present a new algorithm for manifold learning and nonlinear dimension reduction. Based 
on a set of unorganized data points sampled with noise from the manifold, we represent the local 
geometry of the manifold using tangent spaces learned by fitting an afiine subspace in a neighborhood 
of each data point. Those tangent spaces are aligned to give the internal global coordinates of the 
data points with respect to the underlying manifold by way of a partial eigendecomposition of the 
neighborhood connection matrix. We present a careful error analysis of our algorithm and show that 
the reconstruction errors are of second-order accuracy. We illustrate our algorithm using curves and 
surfaces both in 2D/3D and higher dimensional Euclidean spaces, and 64-by-64 pixel face images 
with various pose and lighting conditions. We also address several theoretical and algorithmic issues 
for further research and improvements. 
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1. Introduction. Many high-dimensional data in real- world applications can be 
modeled as data points lying close to a low-dimensional nonlinear manifold. Discov- 
ering the structure of the manifold from a set of data points sampled from the mani- 
fold possibly with noise represents a very challenging unsupervised learning problem 
1|, H, ^, H, ^, |l^, |l^, 14, 17, The discovered low-dimensional structures can 



be further used for classification, clustering, outlier detection and data visualization. 
Example low-dimensional manifolds embedded in high-dimensional input spaces in- 
clude image vectors representing the same 3D objects under different camera views 
and lighting conditions, a set of document vectors in a text corpus dealing with a 
specific topic, and a set of 0-1 vectors encoding the test results on a set of multiple 



choice questions for a group of students |1^, 14, The key observation is that the 
dimensions of the embedding spaces can be very high (e.g., the number of pixels for 
each images in the image collection, the number of terms (words and/or phrases) in 
the vocabulary of the text corpus, or the number of multiple choice questions in the 
test), the intrinsic dimensionality of the data points, however, are rather limited due 
to factors such as physical constraints and linguistic correlations. Traditional dimen- 
sion reduction techniques such as principal component analysis and factor analysis 
usually work well when the data points lie close to a linear (afhne) subspace in the 
input space 0. They can not, in general, discover nonlinear structures embedded in 
the set of data points. 

Recently, there have been much renewed interests in developing efficient algo- 
rithms for constructing nonlinear low-dimensional manifolds from sample data points 
in high-dimensional spaces, emphasizing simple algorithmic implementation and avoid- 
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ing optimization problems prone to local minima |l8|. Two lines of research of 
manifold learning and nonlinear dimension reduction have emerged: one is exempli- 
fied by |, H H where pairwise geodesic distances of the data points with respect to 
the underlying manifold are estimated, and the classical multi-dimensional scaling is 
used to project the data points into a low-dimensional space that best preserves the 
geodesic distances. Another line of research follows the long tradition starting with 
self-organizing maps (SOM) principal curves/surfaces Q and topology-preserving 
networks The key idea is that the information about the global structure of a 
nonlinear manifold can be obtained from a careful analysis of the interactions of the 
overlapping local structures. In particular, the local linear embedding (LLE) method 
constructs a local geometric structure that is invariant to translations and orthogonal 
transformations in a neighborhood of each data points and seeks to project the data 
points into a low-dimensional space that best preserves those local geometries u4, 16l . 

Our approach draws inspiration from and improves upon the work in n% Hq] 
which opens up new directions in nonlinear manifold learning with many fundamen- 
tal problems requiring to be further investigated. Our starting point is not to con- 
sider nonlinear dimension reduction in isolation as merely constructing a nonlinear 
projection, but rather to combine it with the process of reconstruction of the non- 
linear manifold, and we argue that the two processes interact with each other in a 
mutually reinforcing way. In this paper, we address two inter-related objectives of 
nonlinear structure finding: 1) to construct the so-called principal manifold ^ that 
goes through "the middle" of the data points; and 2) to find the global coordinate 
system (the natural parametrization space) that characterizes the set of data points 
in a low-dimensional space. The basic idea of our approach is to use the tangent space 
in the neighborhood of a data point to represent the local geometry, and then align 
those local tangent spaces to construct the global coordinate system for the nonlinear 
manifold. 

The rest of the paper is organized as follows: in section ^, we formulate the 
problem of manifold learning and dimension reduction in more precise terms, and 
illustrate the intricacy of the problem using the linear case as an example. In section 
^, we discuss the issue of learning local geometry using tangent spaces, and in section ^ 
we show how to align those local tangent spaces in order to learn the global coordinate 
system of the underlying manifold. Section |^ discusses how to construct the manifold 
once the global coordinate system is available. We call the new algorithm local tangent 
space alignment (LTSA) algorithm. In section ^, we present an error analysis of LTSA, 
especially illustrating the interactions among curvature information embedded in the 
Hessian matrices, local sampling density and noise level, and the regularity of the 
Jacobi matrix. In section ^, we show how the partial eigendecomposition used in global 
coordinate construction can be efficiently computed. We then present a collection of 
numerical experiments in section ^j. Section ^ concludes the paper and addresses 
several theoretical and algorithmic issues for further research and improvements. 

2. Manifold Learning and Dimension Reduction. We assume that a d- 
dimensional manifold J- embedded in an m-dimensional space {d < m) can be repre- 
sented by a function 

/:Cc7^''^i^", 

where C is a compact subset of TZ'^ with open interior. We are given a set of data 
points xi, • • • , xjv, where Xi e TZ"^ are sampled possibly with noise from the manifold. 
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i.e., 

= fin) + Ei, i = l,...,N, 

where represents noise. By dimension reduction we mean the estimation of the 
unknown lower dimensional feature vectors t^'s from the li's, i.e., the x^'s which are 
data points in TZ"^ is (nonlinearly) projected to r^'s which are points in TZ"^, with 
d < m we realize the objective of dimensionality reduction of the data points. By 
manifold learning we mean the reconstruction of / from the x^'s, i.e., for an arbitrary 
test point r € C C TZ'^, we can provide an estimate of /(r). These two problems 
are inter-related, and the solution of one leads to the solution of the other. In some 
situations, dimension reduction can be the means to an end by itself, and it is not 
necessary to learn the manifold. In this paper, however, we promote the notion that 
both problems are really the two sides of the same coin, and the best approach is 
not to consider each in isolation. Before we tackle the algorithmic details, we first 
want to point out that the key difficulty in manifold learning and nonlinear dimension 
reduction from a sample of data points is that the data points are unorganized, i.e., no 
adjacency relationship among them are known beforehand. Otherwise, the learning 
problem becomes the well-researched nonlinear regression problem (for a more detailed 
discussion, see Q where techniques from computational geometry was used to solve 
error- free manifold learning problems). To ease discussion, in what follows we will 
call the space where the data points live the input space, and the space into which 
the data points are projected the feature space. 

To illustrate the concepts and problems we have introduced, we consider the 
example of linear manifold learning and linear dimension reduction. We assume that 
the set of data points are sampled from a c?-dimensional affine subspace, i.e., 

Xi^c+Un + ei, i = l,...,N, 

where c £ TZ"\Ti £ TZ'^ and £ TZ™" represents noise. U £ JZ"^^'^ is a matrix forms 
an orthonormal basis of the afhne subspace. Let 

X ^ [xi, ■ ■ ■ , xn], T = [ti, • • • , Tat], E ^ [ei, ■ ■ ■ , En]. 

Then in matrix form, the data-generation model can be written as 

X = ce'^ + UT + E, 

here e is an iV-dimensional column vector of all ones. The problem of linear manifold 
learning amounts to seeking c, U and T to minimize the reconstruction error E, i.e., 

min||£;|| = min \\X - (ce'^ + UT)\\f, 

II II ^^^rp II \ /II 

where || • ||f stands for the Frobenius norm of a matrix. This problem can be read- 
ily solved by the singular value decomposition (SVD) based upon the following two 
observations. 

1) The norm of the error matrix E can be reduced by removing the mean of the 
columns of E from each column of E, and hence one can assume that the optimal E 
has zero mean. This requirement can be fulfilled if c is chosen as the mean of X, i.e., 
c = Xe/N=x. 
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2) The low-rank matrix UT is the optimal rank-d approximation to the centered 
data matrix X — xe^. Hence the the optimal solution is given by the SVD of X — xe^ , 

x-xe^ = Q^v^, Pen"'''"', se7^'"^^, v&n^''^, 

i.e., UT — Qd^dVj , where = diag(cri, • • • , ad) with the d largest singular values 
of X — xe^ , Qd and Vd are the matrices of the corresponding left and right singular 
vectors, respectively. The optimal U* is then given by Qd and the learned linear 
manifold is represented by the linear function 

/(r) =x + U*T. 

In this model, the coordinate matrix T corresponding to the data matrix X is given 

by 

T = iU*f{X - xe^) = diag(ai, . . . , <7d)Vj . 
Ideally, the dimension d of the learned linear manifold should be chosen such that 

The function / is not unique in the sense that it can be reparametrized, i.e., 
the coordinate can be replaced by f with a global afline transformation r = Pf , if 
we change the basis matrix U* to U*P. What we are interested in with respect to 
dimension reduction is the low-dimensional representation of the linear manifold in the 
feature space. Therefore, without loss of generality, we can assume that the feature 
vectors are uniformly distributed. For a given data set, this amounts to assuming 
that the coordinate matrix T is orthonormal in row, i.e., TT'^ = Id. Hence we we can 
take T = Vj and the linear function is now the following 

/(t) = x + U* diag(CTi, . . . , ad)T. 

For the linear case we just discussed, the problem of dimension reduction is solved 
by computing the right singular vectors Vd, and this can be done without the help of 
the linear function /. Similarly, the construction of the linear function / is done by 
computing U* which are just the d largest left singular vectors oi X — xe^ . 

The case for nonlinear manifolds is more complicated. In general, the global 
nonlinear structure will have to come from local linear analysis and alignment ^ . 
In , local linear structure of the data set are extracted by representing each point 
Xi as a weighted linear combination of its neighbors, and the local weight vectors 
are preserved in the feature space in order to obtain a global coordinate system. 
In a linear alignment strategy was proposed for aligning a general set of local 
linear structures. The type of local geometric information we use is the tangent 
space at a given point which is constructed from a neighborhood of the given point. 
The local tangent space provides a low-dimensional linear approximation of the local 
geometric structure of the nonlinear manifold. What we want to preserve are the 
local coordinates of the data points in the neighborhood with respect to the tangent 
space. Those local tangent coordinates will be aligned in the low dimensional space 
by different local aflinc transformations to obtain a global coordinate system. Our 
alignment method is similar in spirit to that proposed in [ll7| . In the next section we 
will discuss the local tangent space and global alignment which will then be applied 
to data points sampled with noise in Section 0. 
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3. Local Tangent Space and Its Global Alignment. We assume that is a 
d-dimensional manifold in a m-dimensional space with unknown generating function 
/(r), r G R'^, and we are given a data set consists of TV m-dimensional vectors 
X = [xi, . . . ,xn], Xi ^ TZ™ generated from the following noise- free model, 



fin), 



1, 



where Ti € TZ'^ with d < m. The objective as we mentioned before for nonlinear 
dimension reduction is to reconstruct Ti's from the corresponding function values 
/(Ti)'s without explicitly constructing /. Assume that the function / is smooth 
enough, using first-order Taylor expansion at a fixed r, we have 



(3.1) 



/(r) = /(r) + J/(r) . (f - r) + 0(||f - rf ), 



where J/(t) £ j^rnxd -j-j^g Jacobi matrix of / at r. If we write the m components of 
/(r) as 



f{r) 



h{r) 



fmir) 



then J/(t) 



dfi/dn 



dfrn/dri 



dfi/du 



dfm/dTd 



The tangent space 7^ of / at r is spanned by the d column vectors of J/(t) and is 
therefore of dimension at most d, i.e., 7^ ~ span(J/(r)). The vector t — f gives the 
coordinate of /(r) in the affine subspace /(r) + %■. Without knowing the function /, 
we can not explicitly compute the Jacobi matrix J/(t). However, if we know 7^ in 
terms of Qr, a matrix forming an orthonormal basis of 7^, we can write 



Furthermore, 



The mapping from r to 9* represents a local affinc transformation. This afhne 
transformation is unknown because we do not know the function /. The vector 9*, 
however, has an approximate 9t that orthogonally projects f{f) — /(t) onto 7^-, 



(3.2) 



QUfif)~f{r)) = 9; + 0{\\f-rr), 



provided Qr is known at each r. Ignoring the second-order term, the global coordinate 
T satisfies 

'"dr [ \\Pr{f ^ t) ~ 9r\\df 0. 
jn{T) 

Here ^{t) defines the neighborhood of r. Therefore, a natural way to approximate 
the global coordinate is to find a global coordinate r and a local affine transformation 
Pt that minimize the error function 



(3.3) 



dT 



\Pr{f- 



I dr. 



This represents a nonlinear alignment approach for the dimension reduction problem 
(this idea will be picked up at the end of section ||) . 
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On the other hand, a linear ahgnment approach can be devised as follows. If 
Jfir) is of full column rank, the matrix should be non-singular and 

f - T W p-^9r = LrOr. 

The above equation shows that the affine transformation Lr should align this local 
coordinate with the global coordinate r — f for /(r). Naturally we should seek to find 
a global coordinate r and a local affine transformation to minimize 



(3.4) / dr / \\f -T - LrOrWdf. 

The above amounts to matching the local geometry in the feature space. Notice that 
Or is defined by the "known" function value and the "unknown" orthogonal basis 
matrix Qr of the tangent space. It turns out, however, Qr can be approximately 
determined by certain function values. We will discuss this approach in the next 



section. Clearly, this linear approach is more readily applicable than (3.3). Obviously, 



If the manifold J- is not regular^ i.e., the Jacobi matrix Jf is not of full column rank 



at some points t (z C, then the two minimization problems (3.4) and (3.3) may lead 
to quite different solutions. 

As is discussed in the linear case, the low-dimensional feature vector r is not 
uniquely determined by the manifold J^. We can reparametrize using /((/(r)) 
where g{-) is a smooth 1-to-l onto mapping of C to itself. The parameterization of 
can be fixed by requiring that r has a uniform distribution over C. This will come 
up as a normalization issue in the next section. 

4. Feature Extraction through Alignment. Now we consider how to con- 
struct the global coordinates and local affine transformation when we are given a data 
set X — [xi, . . . , xn] sampled with noise from an underlying nonlinear manifold, 

Xi = /(n) + Ei, i = 1, . . . , AT, 

where e 7?.'^, Xi £ TZ™' with d < m. For each Xi, let Xi — [xi^, . . . , Xi^] be a 
matrix consisting of its k-neavest neighbors including Xi , say in terms of the Euclidean 
distance. Consider computing the best d-dimensional affine subspace approximation 
for the data points in A^, 

A; 

min } \\xi. - {x + Q9j)\\„ ^ mm 11 A^ - (cce^ -|- (59)|L , 

where Q is of d columns and is orthonormal, and O = [6i, ... ,6k]. As is discussed in 
section ^, the optimal x is given by Xi, the mean of all the Xi- 's and the optimal Q is 
given by Qi, the d left singular vectors of Xi{I — ee^/fc) corresponding to its d largest 
singular values, and 6 is given by 6^ defined as 



(4.5) e, = QjX,{I - -ee^) = [e['\ • • • , = Qf{x,^ - x,). 
Therefore we have 

(4.6) x^^ =x,+Q,0f 
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where ^j*' — {I — QiQj){xi- — Xi) denotes the reconstruction error. 

We now consider constructing the global coordinates t^, i = 1, . . . , TV, in the low- 
dimensional feature space based on the local coordinates 6]^ which represents the 
local geometry. Specifically, we want Ti- to satisfy the following set of equations, i.e., 

the global coordinates should respect the local geometry determined by the ^j'^ , 

(4.7) Ti^=fi+Lief j = l,...,k, i = l,...,N, 
where f, is the mean of Ti.,j = 1, . . . ,k. In matrix form, 

Ti = l-Tiee^ + Li&i + Ei, 
k 

where Tj = [ri^, . . . ,Ti^] and Ei = [e^i \ ■ ■ e^'^] is the local reconstruction error 
matrix, and we write 

(4.8) Ei=Ti{I-^ee^)-Liei. 

To preserve as much of the local geometry in the low-dimensional feature space, we 
seek to find and the local affine transformations Li to minimize the reconstruction 
errors ej'\ i.e., 

(4.9) Yl ii^^ii' = E - r""^^ " = '^'^ ■ 

i i 

Obviously, the optimal alignment matrix that minimizes the local reconstruction 
error for a fixed Tj, is given by 

Li = Ti{I - -1:66^)6+ = T,Q+, and therefore Ei = Ti{I - \ee^){I - e+Oi), 

where is the Moor-Pcnrosc generalized inverse of O^. Let T = [ri, . . . , rjv] and Si 
be the 0-1 selection matrix such that TSi = Ti,. We then need to find T to minimize 
the overall reconstruction error 



J2\\EirF = \\TSW\\%, 



where S = [Si,- ■ ■ , Sn], smdW = diag(Wi ,---,Wn) with 

(4.10) Wi = {I-^ee^){I-efei). 

To uniquely determine T, we will impose the constraints TT^ = I4, it turns out that 
the vector e of all ones is an eigenvector of 

(4.11) B = SWW'^S'^ 

corresponding to a zero eigenvalue, therefore, the optimal T is given by the d eigen- 
vectors of the matrix B, corresponding to the 2nd to d-|-lst smallest eigenvalues of 
B. 
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Remark. We now briefly discuss the nonlinear alignment idea mentioned in 
(3.3). In particular, in a neighborhood of a data point Xi consisting of data points 
Xi — [xij^, . . . ,Xif^], by first order Taylor expansion, we have 



X,{I - ee^/k) « J^pT,{I - ee^/fc). 



Let Si be the neighborhood selection matrix as defined before, we seek to find jj-*-* € 
j^rnxd g^^^ rp minimize 



N 



E{J,T) ^ 5] ||(X - jfT)S.{I-ee^/k)\\ 



2 



where J = . . . , j}^']. The LTSA algorithm can be considered as an approach to 

find an approximate solution to the above minimization problem. We can, however, 
seek to find the optimal solution of E{J, T) using an alternating least squares approach: 
fix J minimize E{J, T) with respect to T, and fix T minimize E{J, T) with respect to 
J, and so on. As an initial value to start the alternating least squares, we can use the 
T obtained from the LTSA algorithm. The details of the algorithm will be presented 
in a separate paper. 



Remark. The minimization problem (4.9) needs certain constraints (i.e., nor- 
malization conditions) to be well-posed, otherwise, one can just choose both Ti and 
Li to be zero. However, there are more than one way to impose the normalization 
conditions. The one we have selected, i.e., TT^ = Id, is just one of the possibilities. 
To illustrate the issue we look at the following minimization problem, 

minllAr-r^lli. 

The approach we have taken amounts to substituting Y = XA~^, and minimize ||X(/— 
A+A)||i? with the normalization condition XX^ = /. However, 



\X -YA\\f 



I 

-A 



F 

and we can minimize the above by imposing the normalization condition 

[X,Y][X,Yf =1. 

This nonuniqueness issue is closely related to to nonuniqueness of the parametrization 
of the nonlinear manifold /(r), which can reparametrized as /(r(7]))with a 1-to-l 
mapping T{ri). 

5. Constructing Principal Manifolds. Once the global coordinates are 
computed for each of the data points Xi, we can apply some non-parametric regression 
methods such as local polynomial regression to {(r^, Xi)}^i to construct the principal 
manifold underlying the set of points Xi. Here each of the component functions fj^r) 
can be constructed separately, for example, we have used the simple loess function 

in some of our experiments for generating the principal manifolds. 

In general, when the low-dimensional coordinates Ti are available, we can con- 
struct an mapping from the r-space (feature space) to the x-space (input space) as 
follows. 
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1. For each fixed r, let Ti be the nearest neighbor (i.e., ||t — Ti\\ < \\t — Tj\\, for 
j 7^ i). Define 

where fi be the mean of the feature vectors in a neighbor to which belong. 

2. Back in the input space, we define 

X = Xi + Qi6. 
Let us define hy g : t x the resulted mapping, 

(5.12) g{T)=x, + Q,LT\T-n). 

To distinguish the computed coordinates from the exact ones, in the rest of this 
paper, we denote by r* the exact coordinate, i.e., 

(5.13) =/(T*) + e*. 

Obviously, the errors of the reconstructed manifold represented by g depend on the 
sample errors e*, the local tangent subspace reconstruction errors and the align- 
ment errors t"^\ The following result show that this dependence is linear. 

Theorem 5.1. Let e* ^ x, - /(r*), (}f = (/ - QiQf)ixt - i,), and = 
Ti - fi - LiQf{xi - Xi). Then 

\\9{n)-f{r:)h<\\e*h + \M2 + \\L-h,h. 



Proof. Substituting L- ^(n — Ti) — L- ^e.; + Qf{xi — Xi) into ( 5.12 ) gives 
g{Ti) = Xi+ QiL^^in ~ fi) 

— Xi ^ QiQi i^i ^i) + Qi^i 

Because QiQf{xi — Xi) = Xi — Xi — we obtain that 



'lin) 



fiT:)+^-^ + Q.,L-'^. 



Therefore we have 



MT.)-f{r*)h<Mh + m2 + \\Lih.U 



completing the proof. □ 

In the next section, we will give a detailed error analysis to estimate the errors 
of alignment and tangent space approximation in terms of the noise, the geometric 
properties of the generating function / and the density of the generating coordinates 

T*. 
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6. Error Analysis. As is mentioned in the previous section, we assume that 
that the data points are generated by 

= /«) + £*, i = l,...,A. 

For each Xi , let Xi = [xi-^ , ■ • • , ] be a matrix consisting of its fc- nearest neighbors 
including Xi in terms of the Euclidean distance. Similar to Ei defined in (4.8), we 
denote by E* the corresponding local noise matrix, E* = [e*^ , . . . , e* ] . The low- 
dimensional embedding coordinate matrix computed by the LTSA algorithm is de- 
noted by T = . . . ,Tjv]. We first present a result that bounds ||£'i|| in terms of 
\\E*\\- 

Theorem 6.1. Assume T* = [rj*, . . . ,Tj^] satisfies {T*)^^* = Ud- Let Ti be the 
mean of Ti-^, ■ ■ ■ , Ti^, Denote Pi = Qj Jf{f*) and Hf^{f*) the Hessian matrix of the 
i-th component function of f . If the Pi 's are nonsingular, then 

\mF<\\p-'\\F{5, + \\E*\\p), 

where Si is defined by 

m k 



1=1 j=i 



Furthermore, if each neighborhood is of size 0{rj), then \\E\\ < \\P^ "^||i?||£'*|| -|-0(r/^). 
Proof. First by (4.S), we have 

(6.14) E, - T,{I - iee^) - L,e, = (T, - L,QJ X,){I - ^ee^). 

To represent Xi in terms of the Jacobi matrix of /, we assume that / is smooth 
enough and use Taylor expansion at f * , the mean of the k neighbors of t* , we have 

x^,^ f{j:) + MT:^~f:)+^^ + e,^, 

where Ji = Jf{f*) and Sj^^ represents the remainder term beyond the first order 
expansion, in particular, its l-th components can be approximately written as (using 
second order approximation), 

with the Hessian matrix Hf^{f*) of the ^-th component function fg of / at f*. We 
have in matrix form, 

X^ = /(f ;)e^ + .hT*{I - iee^) -f A, -f E* 
with Ai — [5^\ • • • , (J^*']. Multiplying by the centering matrix / — gives 

(6.15) X,(I - iee^) = {.hT* + A, + E*){I - ^ee^). 



Substituting ( 6.15 ) into ( 6.14 ) and denoting Pi — Ql Ji, we obtain that 
(6.16) E, = [T, - UP,T* - L,Qj{A, + E*)){I - ^ee^). 
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For any T satisfying the orthogonal condition TT'^ = Ij, and any Li, we also have the 
similar expression of ( 6.16| ) for T, and Zj. Note that T and Li, i = \, - ■ ■ ,N , minimize 
the overall reconstruction error, \\E\\p < ||-E||f- Setting T = T* and = we 
obtain the upper bound 

iii?,iiF<iiPr'ii2(iiA.iiF + iii?rii^^). 

We estimate the norm ||Ai||ir by ignoring the higher order terms, and obtain that 

771 k 

\ml<J2J2\\Hf^(^n\\lH~f:\\t^s', 
1=1 j=i 

completing the proof. □ 

The non-singularity of the matrix Pi requires that the Jacobi matrix Ji be of full 
column rank and the two subspaces span( J^) and the d largest left singular vector space 
span((5i) are not orthogonal to each other. We now give a quantitative measurement 
of the non-singularity of Pi. 

Theorem 6.2. Let ad{Ji) be the d-th singular value of Ji = JiT*{I — jce^), and 
denote ai — 4:{\\E*\\p + Si)/ad{Ji) with Si defined in Theorem \6. j| . Then 

\\Pr'\\F<{l + a^)'/\j+\\r. 



Proof. The proof is simple. Let Ji — UjYijVj be the SVD of the matrix Ji. By 

( 3.15 ) and perturbation bounds for singular subspaces (5[ Theorem 8.6.5], the singular 
vector matrix Qi can be expressed as 

(6.17) g, = {Uj + UjH)(I + HjH,)-^'^ 

with 

\m\F<^j-(\\E*\\F + \\^^\\F))<a,. 
^d{Ji) ^ ' 

where ad{Ji) is the d-largest singular value of Ji. On the other hand, from the SVD 
of Ji, we have JiT*Vj = UjYij, which gives 

j,^uj^j{t:v.j)~\ 

It follows that 

P^ = Q^J^ = {I + HjHi)-^'^i:j{T:V.,Y' = {I + H^H,)'^/^Ujj^. 
Therefore we have 

\\p-'\\F<ii + \miy/%jt\\F, 

completing the proof. □ 

The degree of non-singularity of Ji is determined by the curvature of the manifold 
and the rotation of the singular subspace is mainly affected by the sample noises 
Ej's and the neighborhood structure of x^'s. The above error bounds clearly show 
that reconstruction accuracy will suffer if the manifold underlying the data set has 
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singular or near-singular points. This phenomenon will be illustrated in the numerical 
examples in section ||. Finally, we give an error upper bound for the tangent subspace 
approximation. 

Theorem 6.3. Let cond(Ji) = ai{Ji) / ad{Ji) be the spectrum condition number 
of the d-column matrix Ji. Then 

\\X, - (x,e^ + Q.e,)||F < (l + 4(1 + a2)cond(J,)) (||-B*||f + 6,). 



Proof. By (6.15), we write 



(/ - Q^Qf)X{I - iee^) = (/ - Q,Qf)J, + A„ 



with ||Ai||i? < ||£^*||i? + (5i. To estimate — QiQf )Ji\\F, we use the expression (6.17) 
to obtain 



T 



Hi 
I 

Taking norms gives that 

\\{.I-Q:QI)Uf < il + \m\l)\\H,\\FU\\2<Hl + aME*\\F + 6^)condiJ,)^ 

The result required follows. □ 

The above results show that the accurate determination of the local tangent 
space is dependent on several factors: curvature information embedded in the Hessian 
matrices, local sampling density and noise level, and the regularity of the Jacobi 
matrix. 

7. Numerical Computation Issues. One major computational cost of LTSA 
involves the computation of the smallest eigenvectors of the symmetric positive semi- 
defined matrix B defined in (4.11). B in general will be quite sparse because of the 
local nature of the construction of the neighborhoods. Algorithms for computing a 
subset of the eigenvectors for large and/or sparse matrices are based on computing 
projections of B onto a sequence of Krylov subspaces of the form 

Kp{B,VQ) = span{uo,Bi;o,-B^uo, . . .,BP^^vo}, 

for some initial vectors vq . Hence the computation of matrix- vector multiplications 
Bx needs to be done efficiently. Because of the special nature of B, Bx can be 
computed neighborhood by neighborhood without explicitly forming B, 

Bx = SiWiW^S^x + ■■■ + SNWNWjiSjfX, 



where as defined in (4.10), 



Each term in the above summation only involves the Xi 's in one neighborhood. 
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The matrix Q^Qi in the right factor of Wi is the orthogonal projector onto the 
subspace spanned by the rows of (6,). If the SVD of X, - i^e^ = Q(*)s(^) (ij('))'^ is 
available, the orthogonal projector is given by QfQi = HiHf , where Hi is the sub- 
matrix of first d columns of i/^*). Otherwise, one can compute the QR decomposition 
0f — HiRi of 9f and obtain 6^6^ = HiHf ||]. Clearly, we have H^ e = because 
QiC — 0. Then we can rewrite Wi as 

W^^I- iee^ - H,Hj - / - [e/Vfc, H,][e/Vk, H,f = 1- G,GJ . 

It is a orthogonal projector onto the null space spanned by the rows of 6^ and /Vk. 
Therefore the matrix-vector product y = SiWiWf Sf x can be easily computed as 
follows: denote by li = {ii, • • • ,ik} the set of indices for the k nearest-neighbors of 
Xi , then the j-th element of y is zero for j ^ li and 

y{h) = x{h) - G,{Gfx{h)). 

Here y{Ii) = [Vii, ■ • • ,yikV denotes the section of y determined by the neighborhood 
set li. 

If one needs to compute the d smallest eigenvectors that are orthogonal to e 
by applying some eigen-solver with an explicitly formed B. The matrix B can be 
computed by carrying out a partial local summation as follows 

(7.18) B{h,h) ^ B{hJi)+I-G,Gj, i = l,---,N 

with initial B — Q. 

Now we are ready to present our Local Tangent Space Alignment (LTSA) algo- 
rithm. 



Algorithm LTSA (Local Tangent Space Alignment). Given N m- 
dimensional points sampled possibly with noise from an underlying d- 
dimensional manifold, this algorithm produces N d-dimensional coor- 
dinates T S TZ'^'^^ for the manifold constructed from k local nearest 
neighbors. 

Step 1. [Extracting local information.] For each i — 1, - ■ ■ , N, 

1.1 Determine k nearest neighbors Xi. oi Xi, j — I, . . . , k. 

1.2 Compute the d largest eigenvectors gi, - ■ ■ ,gd of the corre- 
lation matrix {Xi — XiC^)'^ {Xi — XiC^), and set 



G, = [e/Vk, 



gi, gd\ 



Step 2. [Constructing alingment matrix.] Form the matrix B by locally 
summing ( 7.18| ) if a direct eigen-solver will be used. Otherwise 



implement a routine that computes matrix- vector multiplication 
Bu for an arbitrary vector u. 
Step 3. [Aligning global cordinates.] Compute the d -I- 1 smallest eigen- 
vectors of B and pick up the eigenvector matrix [m2, • • • , u^+i] 
corresponding to the 2nd to (i+ 1st smallest eigenvalues, and set 
T = [m2, • • • , Ud+lY ■ 
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Fig. 1. sample data points with noise from various 1-D manifolds ( top) and coordinates of computed 
Ti via exact r* (bottom). 

8. Experimental Results. In this section, we present several numerical exam- 
ples to illustrate the performance of our LTSA algorithm. The test data sets include 
curves in 2D/3D Euclidean spaces, and surfaces in 3D Euclidean spaces. Especially, 
we take a closer look at the effects of singular points of a manifold and the interaction 
of noise levels and sample density. To show that our algorithm can also handle data 
points in high-dimensional spaces, we also consider curves and surfaces in Euclidean 
spaces with dimension equal to 100 and an image data set with dimension 4096. 

First we test our LTSA method for ID manifolds (curves) in both 2D and 3D. For 
a given ID manifold /(r) with uniformly sampled coordinates rj*, • • • ,t^ in a fixed 
interval, we add Gaussian noise to obtain the data set {xi} as follows, 

= fi^i) + '7randn(m, 1), 

where m = 2,3 is the dimension of the input space, and randn is Matlab's standard 
normal distribution. In Figure 1, in the first row from left to right, we plot the color- 
coded sample data points corresponding to the following three one- variable functions 

/(r) = (lOr, 10t3 + 2t2-10t)^, t e [-1, 1], = 0.1, 
/(r) = (tcos(t), Tsin(T))'^, t e [0, 47r], 77 = 0.2, 

/(r) = (3cos(t), 3sin(T), 3t)^, t g [0, 47r], = 0.2. 

In the second row, we plot t* against r^, where t^'s are the computed coordinates 
by LTSA. Ideally, the {T*,Ti) should form a straight line with either a 7r/4 or —tt/4 
slope. 

It is also important to better understand the failure modes of LTSA, and ulti- 
mately to identify conditions under which LTSA can truly uncover the hidden non- 
linear structures of the data points. As we have shown in the error analysis in section 
, it will be difficult to align the local tangent information if some of the Pi's 
defined in ( |3.2| ) are close to be singular. One effect of this is that the computed co- 
ordinates Ti and its neighbors may be compressed together. To clearly demonstrate 
this phenomenon, we consider the following function, 



/(r) = [cos3(r), sm\T)f, r G [0, tt]. 
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Fig. 2. 1-D manifolds with singular points (left) and corresponding coordinates Ti via exact t* 
(right). 



The Jacobi matrix (now a single vector since d = I) given by 
J/(r) = 1.5sin(2T)[-cos(T), sin(r)]^ 



is equal to zero at r = n/2. In that case the 6'-vector Qi defined in (4.5) will be 
computed poorly in the presence of noise. Usually the corresponding 0; will be small 
which also results in small Ti and the neighbors of Ti will also be small. In the first 
column of Fig 2, we plot the computed results for this 1-D curve. We see clearly near 
the singular point t — tt/2 the computed r^'s become very small, all compressed to 
a small interval around zero. In the second column of Fig 2, we examine another ID 
curve defined by 

/(r) = [lOcos(r), sin(r)]^, r G [7r/2, 37r/2]. 

We notice that similar phenomenon also occurs near the point t — n where the 
curvature of the curve is large, the computed t^'s near the corresponding point also 
become very small, clustering around zero. 

Next we look at the issues of the interaction of sampling density and noise levels. 
If there are large noises around /(r^) relative to the sampling density near /(r^), the 
resulting centered local data matrix Xi{I — \ee^) will not be able to provide a good 
local tangent space, i.e., Xi{I — will have singular values ad and dd+i that 

are close to each other. This will result in a nearly singular matrix Pi = QfJi, and 
when plotting r* against Ti , we will see the phenomenon of the computed coordinates 
Ti getting compressed, similar to the case when the generating function /(r) has 
singular and/or near-singular points. However, in this case, the result can usually be 
improved by increasing the number of neighbors used for producing the shifted matrix 
Xi{I — ^ee^). In Fig 3, we plot the computed results for the generating function 

/(r) = 3t^ + 2t2 - 2t, t e [-1.1, 1]. 

The data set is generated by adding noise in a relative fashion. 



Xi ^ /(Ti)(l -I- rjei) 
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Fig. 3. 1-D manifolds with different noise levels (top) and computed coordinates Ti vs. exact r* 
(bottom). 

with normally distributed Ci. The first three columns in Fig 3 correspond to the noise 
levels rj = 0.01, -q = 0.03, and 77 = 0.05, respectively. For the three data sets. We 
use the same number of neighbors, k — 10. With the increasing noise level 77, the 
computed r^'s get expressed at points with relatively large noise. The quality of the 
computed r^'s can be improved if we increase the number of neighbors as is shown on 
the column (d) in Fig 3. The improved result is for the same data set in column (c) 
with fc = 20 used. 

As we have shown in Fig 3 (column (d)), different neighborhood size k will produce 
different embedding results. In general, k should be chosen to match the sampling 
density, noise level and the curvature at each data points so as to extract an accurate 
local tangent space. Too few neighbors used may result in a rank-deficient tangent 
space leading to over-fitting, while too large a neighborhood will introduce too much 
bias and the computed tangent space will not match the local geometry well. It 
is therefore worthy of considering variable number of neighbors that are adaptively 
chosen at each data point. Fortunately, our LTSA algorithm seems to be less sensitive 
to the choice of k than LLE does as will be shown later. 

We now apply LTSA to a 2-D manifold embedded in a 100 dimensional space. 
The data points are generated as follows. First we generate N — 5000 3D points, 

Xi = {U, Si, h(ti,Si))'^ + O.Olrji 

with ti and uniformly distributed in the interval [—1, 1], the rji^s are standard 
normal. The h{t, s) is a peak function defined by 

h{t, s) = 0.3(1 - i)2e-*'-(-+i)' - (0.2i - t3 - s5)e-*'-^' - 0.1e-^'+^'>"-'\ 

This function is plotted in the left of Figure 4. We generate two kinds of data points 
xf and xf in lOOD space, 

X,^ — Qxi, Xi — Hxi, 

where Q is a random orthogonal matrix resulting in an orthogonal transformation 
and H a matrix with its singular values uniformly distributed in (0, 1) resulting in an 
affine transformation. Figure 4 plots the coordinates for x^ (middle) and xf (right). 
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(a) Cb) <o> 




Fig. 4. 2-D manifold in a 100-D space generated by 3-D peak function: (a) 3-D peak curve, (b) 
coordinates for orthogonal transformed manifold, (c) coordinates for affine transformed manifold. 




Fig. 5. Singular value ratios p^"'"^ (-\--dots) and p^' (--dots). 



One advantage of LTSA over LLE is that using LTSA we can potentially detect 
the intrinsic dimension of the underlying manifold by analyzing the local tangent space 
structure. In particular, we can examine the distribution of the singular values of the 
local data matrix Xi consisting of the data points in the neighborhood of each data 
point Xi. If the manifold is of dimension d, then Xi will be close to a rank-d matrix. 
We illustrate this point below. The data points are xf of the 2D peak manifold in 
the lOOD space. For each local data matrix Xi, let aj^i be the j-the singular value of 
the centered matrix Xi{I — ^ee^). Define the ratios 

0") ^ 

Pi 

In FigjH wc plot the rations p[^^ and pf \ It clearly shows the feature space should 
be 2-dimensional. 

Next, we discuss the issue of how to use the global coordinates r^'s as a means for 
clustering the data points x^'s. The situation is illustrated by Figure 6. The data set 
consists of three bivariate Gaussians with covariance matrices O.2/2 and mean vectors 
located at [1, 1], [1, — 1], [— 1, 0]. There are 100 sample points from each Gaussian. 
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Fig. 6. (left) Global coordinates by LLE, (middle) global coordinates by LTSA, (right) Three Gaus- 
sian data with principal curves 



The thick curve on the right panel represents the principal curve computed by LTSA 
and the thin curve by LLE. It is seen that the thick curve goes through each of the 
Gaussians in turn, and the corresponding global coordinates (plotted in the middle 
panel) clearly separate the three Gaussians. LLE did not perform as well, mixing two 
of the Gaussians. 

Last we look at the results of applying LTSA algorithm to the face image data 
set [Q. The data set consists of a sequence 698 64-by-64 pixel images of a face 
rendered under various pose and lighting conditions. Each image is converted to an 
m — 4096 dimensional image vector. We apply LTSA with k = 12 neighbors and 
d = 2. The constructed coordinates are plotted in the middle of Figure 7. We also 
extracted four paths along the boundaries of of the set of the 2D coordinates, and 
display the corresponding images along each path. It can be seen that the computed 
2D coordinates do capture very well the pose and lighting variations in a continuous 
way. 

9. Conclusions and Feature Works. In this paper, we proposed a new al- 
gorithm (LTSA) for nonlinear manifold learning and nonlinear dimension reduction. 
The key techniques we used are construction of local tangent spaces to represent local 
geometry and the global alignment of the local tangent spaces to obtain the global 
coordinate system for the underlying manifold. We provide some careful error anal- 
ysis to exhibit the interplay of approximation accuracy, sampling density, noise level 
and curvature structure of the manifold. In the following, we list several issues that 
deserve further investigation. 

1. To make LTSA (similarly LLE) more robust against noise, we need to resolve 
the issue where several of the smallest eigenvalues of B are about the same magnitude. 
This can be clearly seen when the manifold itself consists of several disjoint compo- 
nents. If this is the case, one needs to break the matrix B into several diagonal blocks, 
and apply LTSA to each of the block. However, with noise, the situation becomes 
far more complicated, several eigenvectors corresponding to near-zero eigenvalues can 
mix together, the information of the global coordinates seems to be contained in the 
eigen-subspace, but how to unscramble the eigenvectors to extract the global coor- 
dinate information needs more careful analysis of the eigenvector matrix of B and 
various models of the noise. Some preliminary results on this problem have been 
presented in |p^ . 

2. The selection of the set of points to estimate the local tangent space is very 
crucial to the success of the algorithm. Ideally, we want this set of points to be close 
to the tangent space. However, with noise and/or at the points where the curvature 
of the manifold is large, this is not an easy task. One line of ideas is to do some 
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Fig. 7. Coordinates computed by Algorithm LTSA with k = 12 neighbors (middle) and images 
corresponding to the points on the bound lines (top, bottom, left, and right) Left. 



preprocessing of the data points to construct some restricted local neighborhoods. 
For example, one can first compute the minimum Euclidean spanning tree for the 
data set, and restrict the neighbors of each point to those that are linked by the 
branches of the spanning tree. This idea has been applied in self-organizing map 
Ip^ . Another idea is to use iterative-refinement, combining the computed r^'s with 
the Xi^s for neighborhood construction in another round of nonlinear projection. The 
rationale is that r^'s as the computed global coordinates of the nonlinear manifold 
may give a better measure of the local geometry. 

3. A discrete version of the manifold learning can be formulated by considering the 
data points as the vertices of an undirected graph . A quantization of the global 
coordinates specifies the adjacency relation of those vertices, and manifold learning 
becomes discovering whether an edge should be created between a pair of vertices 
or not so that the resulting vertex neighbors resemble those of the manifold. We 
need to investigate a proper formulation of the problem and the related optimization 
methods. 

4. From a statistical point of view, it is also of great interest to investigate 
more precise formulation of the error model and the associated consistency issues and 
convergence rate as the sample size goes to infinity. The learn-ability of the nonlinear 
manifold also depends on the sampling density of the data points. Some of the results 
in non-parametric regression and statistical learning theory will be helpful to pursue 



20 



ZHENYUE ZHANG AND HONGYUAN ZHA 



this line of research. 
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